Speech enhancement using clustering of cues

ABSTRACT

A method for speech enhancement, the method may include receiving or generating sound samples that represent sound signals that were received during a given time period by an array of microphones; frequency transforming the sound samples to provide frequency-transformed samples; clustering the frequency-transformed samples to speakers to provide speaker related clusters, wherein the clustering is based on (i) spatial cues related to the received sound signals and (ii) acoustic cues related to the speakers; determining a relative transfer function for each speaker of the speakers to provide speakers related relative transfer functions; applying a multiple input multiple output (MIMO) beamforming operation on the speakers related relative transfer functions to provide beamformed signals; and inverse-frequency transforming the beamformed signals to provide speech signals.

BACKGROUND

The performance of the speech enhancement modules depends upon theability to filter out all the interference signals leaving only thedesired speech signals. Interference signals might be, for example,other speakers, noise from air conditions, music, motor noise (e.g. in acar or airplane) and large crowd noise also known as ‘cocktail partynoise’. The performance of speech enhancement modules is normallymeasured by their ability to improve the speech-to-noise-ratio (SNR) orthe speech-to-interference-ratio (SIR), which reflects the ratio (oftenin dB scale) of the power of the desired speech signal to the totalpower of the noise and of other interfering signals respectively.

There is a growing need to perform speech enhancement in a reverberantenvironment.

SUMMARY

There may be provided method for speech enhancement, the method mayinclude: receiving or generating sound samples that represent soundsignals that were received during a given time period by an array ofmicrophones; frequency transforming the sound samples to providefrequency-transformed samples; clustering the frequency-transformedsamples to speakers to provide speaker related clusters, wherein theclustering may be based on (i) spatial cues related to the receivedsound signals and (ii) acoustic cues related to the speakers;determining a relative transfer function for each speaker of thespeakers to provide speakers related relative transfer functions;applying a multiple input multiple output (MIMO) beamforming operationon the speakers related relative transfer functions to providebeamformed signals; and inverse-frequency transforming the beamformedsignals to provide speech signals.

The method may include generating the acoustic cues related to thespeakers.

The generating of the acoustic cues may include searching for a keywordin the sound samples; and extracting the acoustic cues from the keyword.

The method may include extracting spatial cues related to the keyword.

The method may include using the spatial cures related to the keyword asa clustering seed.

The acoustic cues may include pitch frequency, pitch intensity, one ormore pitch frequency harmonics, and intensity of the one or more pitchfrequency harmonics.

The method may include associating a reliability attribute to each pitchand determining that a speaker that may be associated with the pitch maybe silent when a reliability of the pitch falls below a predefinedthreshold.

The clustering may include processing the frequency-transformed samplesto provide the acoustic cues and the spatial cues; tracking over timestates of speakers using the acoustic cues; segmenting the spatial cuesof each frequency component of the frequency-transformed signals togroups; and assigning to each group of frequency-transformed signals anacoustic cue related to a currently active speaker.

The assigning may include calculating, for each group offrequency-transformed signals, a cross-correlation between elements ofequal-frequency lines of a time frequency map with elements that belongto other lines of the time frequency map and and may be related to thegroup of frequency-transformed signals.

The tracking may include applying an extended Kalman filter.

The tracking may include applying multiple hypothesis tracking.

The tracking may include applying a particle filter.

The segmenting may include assigning a single frequency componentrelated to a single time frame to a single speaker.

The method may include monitoring at least one monitored acousticfeature out of speech speed, speech intensity and emotional utterances.

The method may include feeding the at least one monitored acousticfeature to an extended Kalman filter.

The frequency-transformed samples may be arranged in multiple vectors,one vector per each microphone of the array of microphones; wherein themethod may include calculating an intermediate vector by weightaveraging the multiple vectors; and searching for acoustic cuecandidates by ignoring elements of the intermediate vector that have avalue that may be lower than a predefined threshold.

The method may include determining the predefined threshold to be threetimes a standard deviation of a noise.

There may be provided a non-transitory computer readable medium thatstores instructions that once executed by a computerized system causethe computerized system to: receive or generate sound samples thatrepresent sound signals that were received during a given time period byan array of microphones; frequency transform the sound samples toprovide frequency-transformed samples; cluster the frequency-transformedsamples to speakers to provide speaker related clusters, wherein theclustering may be based on (i) spatial cues related to the receivedsound signals and (ii) acoustic cues related to the speakers; determinea relative transfer function for each speaker of the speakers to providespeakers related relative transfer functions; apply a multiple inputmultiple output (MIMO) beamforming operation on the speakers relatedrelative transfer functions to provide beamformed signals;inverse-frequency transform the beamformed signals to provide speechsignals.

The non-transitory computer readable medium may store instructions forgenerating the acoustic cues related to the speakers.

The generating of the acoustic cues may include searching for a keywordin the sound samples; and extracting the acoustic cues from the keyword.

The generating of the acoustic cues may include searching for a keywordin the sound samples; and extracting the acoustic cues from the keyword.

The non-transitory computer readable medium may store instructions forextracting spatial cues related to the keyword.

The non-transitory computer readable medium may store instructions forusing the spatial cures related to the keyword as a clustering seed.

The acoustic cues may include pitch frequency, pitch intensity, one ormore pitch frequency harmonics, and intensity of the one or more pitchfrequency harmonics.

The non-transitory computer readable medium may store instructions forassociating a reliability attribute to each pitch and determining that aspeaker that may be associated with the pitch may be silent when areliability of the pitch falls below a predefined threshold.

The clustering may include processing the frequency-transformed samplesto provide the acoustic cues and the spatial cues; tracking over timestates of speakers using the acoustic cues; segmenting the spatial cuesof each frequency component of the frequency-transformed signals togroups; and assigning to each group of frequency-transformed signals anacoustic cue related to a currently active speaker.

The assigning may include calculating, for each group offrequency-transformed signals, a cross-correlation between elements ofequal-frequency lines of a time frequency map with elements that belongto other lines of the time frequency map and and may be related to thegroup of frequency-transformed signals.

The tracking may include applying an extended Kalman filter.

The tracking may include applying multiple hypothesis tracking.

The tracking may include applying a particle filter.

The segmenting may include assigning a single frequency componentrelated to a single time frame to a single speaker.

The non-transitory computer readable medium may store instructions formonitoring at least one monitored acoustic feature out of speech speed,speech intensity and emotional utterances.

The non-transitory computer readable medium may store instructions forfeeding the at least one monitored acoustic feature to an extendedKalman filter.

The frequency-transformed samples may be arranged in multiple vectors,one vector per each microphone of the array of microphones; wherein thenon-transitory computer readable medium may store instructions forcalculating an intermediate vector by weight averaging the multiplevectors; and searching for acoustic cue candidates by ignoring elementsof the intermediate vector that have a value that may be lower than apredefined threshold.

The non-transitory computer readable medium may store instructions fordetermining the predefined threshold to be three times a standarddeviation of a noise.

There may be provided a computerized system that may include an array ofmicrophones, a memory unit and a processor. The processor may beconfigured to receive or generate sound samples that represent soundsignals that were received during a given time period by an array ofmicrophones; frequency transform the sound samples to providefrequency-transformed samples; cluster the frequency-transformed samplesto speakers to provide speaker related clusters, wherein the clusteringmay be based on (i) spatial cues related to the received sound signalsand (ii) acoustic cues related to the speakers; determine a relativetransfer function for each speaker of the speakers to provide speakersrelated relative transfer functions; apply a multiple input multipleoutput (MIMO) beamforming operation on the speakers related relativetransfer functions to provide beamformed signals; inverse-frequencytransform the beamformed signals to provide speech signals; and whereinthe memory unit may be configured to store at least one of the soundsamples and the speech signals.

The computerized system may not include the array of microphones but mayreceive signals from the array of microphones that represent the soundsignals that were received during the given time period by the array ofmicrophones.

The processor may be configured to generate the acoustic cues related tothe speakers.

The generating of the acoustic cues may include searching for a keywordin the sound samples; and extracting the acoustic cues from the keyword.

The processor may be configured to extract spatial cues related to thekeyword.

The processor may be configured to use the spatial cures related to thekeyword as a clustering seed.

The acoustic cues may include pitch frequency, pitch intensity, one ormore pitch frequency harmonics, and intensity of the one or more pitchfrequency harmonics.

The processor may be configured to associate a reliability attribute toeach pitch and determining that a speaker that may be associated withthe pitch may be silent when a reliability of the pitch falls below apredefined threshold.

The processor may be configured to cluster by processing thefrequency-transformed samples to provide the acoustic cues and thespatial cues; track over time states of speakers using the acousticcues; segmenting the spatial cues of each frequency component of thefrequency-transformed signals to groups; and assign to each group offrequency-transformed signals an acoustic cue related to a currentlyactive speaker.

The processor may be configured to assign by calculating, for each groupof frequency-transformed signals, a cross-correlation between elementsof equal-frequency lines of a time frequency map with elements thatbelong to other lines of the time frequency map and and may be relatedto the group of frequency-transformed signals.

The processor may be configured to track by applying an extended Kalmanfilter.

The processor may be configured to track by applying multiple hypothesistracking.

The processor may be configured to track by applying a particle filter.

The processor may be configured to segment by assigning a singlefrequency component related to a single time frame to a single speaker.

The processor may be configured to monitor at least one monitoredacoustic feature out of speech speed, speech intensity and emotionalutterances.

The processor may be configured to feed the at least one monitoredacoustic feature to an extended Kalman filter.

The frequency-transformed samples may be arranged in multiple vectors,one vector per each microphone of the array of microphones; wherein theprocessor may be configured to calculate an intermediate vector byweight averaging the multiple vectors; and search for acoustic cuecandidates by ignoring elements of the intermediate vector that have avalue that may be lower than a predefined threshold.

The processor may be configured to determine the predefined threshold tobe three times a standard deviation of a noise.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to understand the invention and to see how it may be carriedout in practice, a preferred embodiment will now be described, by way ofnon-limiting example only, with reference to the accompanying drawings.

FIG. 1 illustrates multipath;

FIG. 2 illustrates an example of a method;

FIG. 3 illustrates an example of a clustering step of the method of FIG.2;

FIG. 4 illustrates an example of a pitch detection over a time-frequencymap; and

FIG. 5 illustrates an example of a a time-frequency-Cue map.

DETAILED DESCRIPTION OF THE DRAWINGS

Any reference to a system should be applied, mutatis mutandis to amethod that is executed by a system and/or to a non-transitory computerreadable medium that stores instructions that once executed by thesystem will cause the system to execute the method.

Any reference to method should be applied, mutatis mutandis to a systemthat is configured to execute the method and/or to a non-transitorycomputer readable medium that stores instructions that once executed bythe system will cause the system to execute the method.

Any reference to a non-transitory computer readable medium should beapplied, mutatis mutandis to a method that is executed by a systemand/or a system that is configured to execute the instructions stored inthe non-transitory computer readable medium.

The term “and/or” is additionally or alternatively.

The term “system” means a computerized system.

Speech enhancement methods are focused on extracting a speech signalfrom a desired source (speaker) when the signal is interfered by noiseand other speakers. In a free-filed environment, spatial filtering inthe form of directional beamforming is effective. However, in areverberant environment, the speech from each source is smeared acrossseveral directions, not necessarily successive, deteriorating theadvantages of the ordinary beamformers. Using transfer-function (TF)based beamformers to address this issue, or using the relative transferfunction (RTF) as the TF itself are a promising direction. However, inmulti-speaker environments, the ability to estimate the RTF for eachspeaker, when the speech signals are captured simultaneously, is yet achallenge. There is provided a solution that involves tracking acousticand spatial cues to cluster simultaneous speakers, thereby facilitatingestimation of the RTF of the speakers in a reverberant environment.

There is provided a clustering algorithm of speakers which assigns eachfrequency component to its original speaker especially in multi-speakerreverberant environments. This provides the necessary condition for theRTF estimator to work properly in multi-speaker reverberantenvironments. The estimate of the RTFs matrix is then used to computethe weight vector of the transfer function based linear constrainedminimum variance (TF-LCMV) beamformer (see Equation (10) in the sequel)and thus satisfies the necessary condition for TF-LCMV to work. It isassumed that each human speaker is endowed with a different pitch, sothat the pitch is a bijective indicator to a speaker. Multi-pitchdetection is known to be a challenging task especially in a noisy,reverberant multi-speaker environment. To address this challenge, theW-Disjoint Orthogonality (W-DO) assumption is employed, and a set ofspatial cues, for example, signal intensity, azimuth angle and elevationangle, are used as additional features. The acoustical cues—pitchvalues—are tracked over time using extended Kalman filter (EKF) toovercome temporary inactive speakers and changes in pitch, and thespatial cues are used to segment the last L frequency components and toassign each frequency component to different sources. The result of theEKF and the segmentation is combined by means of cross-correlation tofacilitate the clustering of the frequency components to a specificspeaker with a specific pitch.

FIG. 1 describes the paths along which the frequency components of thespeech signal travel from a human speaker 11 to the microhome array 12in a reverberant environment. The walls 13 and other elements in theenvironment 14 reflect the impinging signal with attenuation andreflecting angle which depend on the material and the texture of thewall. Different frequency components of the human speech might takedifferent paths. These might be a direct path 15 which reside on theshortest path between the human speaker 11 and the microphone array 12,or indirect paths 16, 17. Note that a frequency component might travelalong one or more paths.

FIG. 2 describes the algorithm. The signals are acquired by themicrophone array 201 which contains M≥2 microphones, where M=7microphones is one example. The microphones can be deployed in a rangeof constellations such as equally-spaced on a straight line, on a circleor on a sphere, or even unevenly spaced forming arbitrary shape. Thesignal from each microphone is sampled, digitized, and stored in Mframes, each contains T consecutive samples 202. The size of the framesT may be selected to be large enough such that the short-time Fouriertransform (STFT) is accurate, but short enough so that the signal isstationary along the equivalent time duration. A typical value for T is4,096 samples for sampling rate of 16 kHz, that is, the frame isequivalent to ¼ second. Often, consecutive frames overlap each other forimproved tracking after the features of the signal over time. A typicaloverlap is 75%, that is, a new frame is initiated every 1,024 samples. Tmay, for example, range between 0.1 Sec-2 Sec—thereby providing1024-32768 sampled for 16 kHz sampling rate. The samples are alsoreferred to as sound samples that represent sound signals that werereceived by the array of microphones during period of time T.

Each frame is transformed in 203 to the frequency domain by applyingFourier transform or a variant of Fourier transform such as short timeFourier transform (STFT), constant-Q transform (CQT), logarithmicFourier transform (LFT), filter bank and alike. Several techniques suchas windowing and zero-padding might be applied to control the framingeffect. The results of 203 is M complex-valued vectors of length K. If,for example, the array includes 7 microphones, 7 vectors are preparedwhich are registered by the frame time-index

. K is the number of frequency bins, and is determined by the frequencytransform. For example, when using ordinary STFT, K=T which is thelength of the buffer. The output of step 203 may be referred to asfrequency-transformed signals.

The speech signals are clustered to different speakers in 204. Theclusters may be referred to as speaker related clusters. Unlike priorart works which cluster speakers based on direction only, 204 deals withmulti-speakers in a reverberant room, so that signals from differentdirections can be assigned to the same speaker due to the direct pathsand the indirect paths. The proposed solution suggests using a set ofacoustic cues, for example, the pitch frequency and intensity, and itsharmonics frequencies and intensities, on top of a set of spatial cues,for example the direction (azimuth and elevation) and the intensity ofthe signal in one of the microphones. The pitch and one or more of thespatial cues are served as the state vector for a tracking algorithmsuch as Kalman filter and its variants, multiple hypothesis tracking(MHT) or particle filter, which are used to track this state vector, andto assign each track to a different speaker.

All these tracking algorithms use a model which describes the dynamicsof the state vector in time, so that, when measurements of the statevector are missing or corrupted by noise, the tracking algorithmcompensate for this using the dynamic model, and simultaneously updatesthe model parameters. The output of this stage is a vector, assigningeach frequency component at a given time

to each speaker. 204 is further elaborated in FIG. 3.

An RTF estimator is applied in 205 to the data in the frequency domain.The result of this stage is a set of RTFs each is registered to theassociate speaker. The registration process, is done using theclustering array from the clustering speakers 204. The set of RTFs arealso referred to as speakers related relative transfer functions.

The MIMO beamformer 206 reduces the energy of the noise and of theinterfering signals with respect to the energy of the required speechsignal by means of spatial filtering. The output of step 206 may bereferred to as beamformed signals. The beamformed signals are thenforwarded to the inverse frequency transform 207 to create a continuousspeech signal in the form of a stream of samples, which is transferred,in turn, to other elements such as speech recognition, communicationsystems and recording devices 208.

In a preferred embodiment of the invention, a keyword spotting 209 canbe used to improve the performance of the clustering block 204. Theframes from 202 are searched for a pre-defined keyword (for example“hello Alexa”, or “ok Google”). Once the keyword is spotted in thestream of frames, the acoustic cues of the speaker are extracted, suchas the pitch frequency and intensity and its harmonics frequencies andintensities. Also, the features of the paths over which each frequencycomponent has arrived at the microphone array 201, are extracted. Thesefeatures are used by the clustering speaker 204 as a seed for thecluster of the desired speaker. Seed is an initial guess as to theinitial parameters of the cluster. For example, the cluster's centroid,radius and statistics for centroid-based clustering algorithms such asK-means, PSO and 2 KPM. Another example is the bases of the subspace forsubspace-based clustering.

FIG. 3 describes the clustering algorithm of speakers. It is assumedthat each speaker is endowed with a different set of acoustic cues, forexample, pitch frequency and intensity and its harmonics frequencies andintensities, so that the set of acoustic cues is a bijective indicatorto a speaker. Acoustic cues detection is known to be a challenging taskespecially in a noisy, reverberant multi-speaker environment. To addressthis challenge, the spatial cues, for example, in the form of the signalintensity, the azimuth angle and the elevation angle are used. Theacoustical cues are tracked over time using filters such as particlefilter and extended Kalman filter (EKF) to overcome temporary inactivespeakers and changes in acoustic cues, and the spatial cues are used tosegment the frequency components among different sources. The result ofthe EKF and the segmentation is combined by means of cross-correlationto facilitate the clustering of the frequency components to a specificspeaker with a specific pitch.

In 31 potential acoustic cues in the form of pitch frequencies aredetected as an example of one preferred embodiment. First, atime-frequency map is prepared using the frequency transform of thebuffers from each microphone, which are computed in 203. Next, theabsolute value of each of the M K-long complex-valued vectors areweight-averaged, with some weight factors which can be determined so asto diminish artifacts in some of the microphones. The result is a singleK-long real vector. In this vector, values higher than a given thresholdare extracted, while the rest of the elements are discarded. Thethreshold is often selected adaptively as being three times the standarddeviation of the noise, but no less than a constant value which dependson the electrical parameters of the system, and especially on the numberof effective bits of the sampled signal. Values with frequency indexwithin the range of [k_min, k_max] are defined as candidates for pitchfrequencies. Variable k_min and k_max are typically 85 Hz and 2550 Hzrespectively, as typical adult male will have a fundamental frequencyfrom 85 to 1800 Hz, and that of a typical adult female from 165 to 2550Hz. Each pitch candidate is then verified by searching for its higherharmonics. The existence of the 2^(nd) and 3^(rd) harmonics may be aprerequisite for a candidate pitch to be detected as a legitimate pitchwith reliability R (say, R=10). If higher harmonics (e.g., 4^(th) and5^(th)) exist, the reliability of the pitch may be increased—for exampledoubled for each harmonic. An example can be found in FIG. 4. In apreferred embodiment of the invention, the pitch of the desired speaker32 is supplied by 210 using a keyword that was uttered by the desiredspeaker. The supplied pitch 32 is added to the list with the highestpossible reliability, say R=1000.

In 33, an extended Kalman filter (EKF) is applied to the pitch from 31.As noted by the Wikipedia entry on extended Kalman filters(www.wikipedia.org/wiki/Extended_Kalman_filter), a Kalman filter has astate transition equation and an observation model. The state transitionequation, for a discrete calculation, is:

x _(k) =f(x _(k−1) ,u _(k))+w _(k)  (1)

And the observation model, for a discrete calculation, is:

z _(k) =h(x _(k))+v _(k)  (2)

where x_(k) is the state vector which contains parameters which(partially) describe the status of a system, u_(k) is a vector ofexternal inputs which provide information on the status the system,w_(k) and v_(k) are the process and observation noises. Time updater ofthe extended Kalman filter may predict the next state with predictionequations and detected pitch may update the variables by comparing theactual measurement with the predicted measurement, using the followingtype of equation:

y _(k) =z _(k) −h(x _(k|k+1))  (3)

where z_(k) is the detected pitch and y_(k) is the error between themeasurement and the predicted pitch.

In 33, each trajectory may begin from a detected pitch, followed by amodel f (x_(k), u_(k)), reflecting the temporal behavior of the pitch,which might go higher or lower because of emotions. The model's inputsmay be past state vectors x_(k) (either one state vector or more), andany external inputs u_(k) which affect the dynamics of the pitch, suchas the speed of the speech, intensity of speech and emotionalutterances. The elements of the state vector x may quantitativelydescribe the pitch. For example, a state vector of a pitch mightinclude, inter alia, the pitch frequency, the intensity of the 1^(st)order harmonics, and the frequency and intensity of higher harmonics.The vector function f (x_(k), u_(k)) may be used to predict thestate-vector x at some given time k+1 ahead of the current time. Anexemplary realization of the dynamic model in the EKF may include thetime update equation (a.k.a. prediction equation) as is described in thebook “Lessons in Digital Estimation Theory” by Jerry M. Mendel, which isincorporated herein by reference.

Considering, for example, the 3-tuple state-vector:

x _(k)=[f _(k) a _(k) b _(k)]^(T)∈

³  (4)

where f_(k) is the frequency of the pitch (1^(st) harmonic) at time k,a_(k) is the intensity of the pitch (1^(st) harmonic) at time k, andb_(k) is the intensity of the 2^(nd) harmonic at time k.

An exemplary state-vector model for the pitch may be:

x _(k) =x _(k−1)∈

⁴  (5)

Which describes a model which assumes a constant pitch at all time. In apreferred embodiment of the invention, the speed of the speech,intensity of speech and emotional utterances using speech recognitionalgorithms as are known in the art, are monitored continuously,providing external inputs u_(k) which improves the time update stage ofthe EKF. Emotional utterance methods are known in the art. See, forexample “New Features for Emotional Speech Recognition” by Palo et. al.

Each track is endowed with reliability field which is inverselyproportional to the time over which the track evolves using the timeupdate only. When the reliability of a track goes below some reliabilitythreshold ρ, say, representing 10 seconds of undetected pitch, the trackis defined as dead, which means that the respective speaker is notactive. On the other hand, when a new measurement (pitch detection)appears, which cannot be assigned to any of the existing tracks, a newtrack is initiated.

In 34, the spatial cues are extracted from the M frequency-transformedframes. As in 31, the recent L vectors are saved for analysis usingcorrelation in time. The result is a time-frequency-Cue (TFC) map, whichis a 3-dimensional array of size L×K×P (where P=M−1) for each of the Mmicrophones. The TFC is described in FIG. 5.

In 35, the spatial cues of each frequency component in the TFC aresegmented. The idea is that along the L frames, a frequency componentmight originate from different speakers, and this can be observed bycomparing the spatial cues. It is assumed, however, that at a singleframe time l, the frequency component originates from a single speaker,owing to the W-DO assumption. The segmentation can be performed usingany known method in the literature which is used for clustering such asK nearest neighbors (KNN). The clustering assigns an index c(k,l)∈

to each cell in A, which indicates to which cluster the cell (k,l)belongs.

In 36, the frequency components of the signals are grouped such thateach frequency component is assigned to a specific pitch in the list ofpitches which are tracked by the EKF and is active by its reliability.This is done by computing the sample-cross-correlation between thek^(th) line of the time-frequency map (see FIG. 4), which is assigned toone of the pitches, with all the values with a specific cluster indexc₀(j,l) in other lines in the time-frequency map. This is done for everycluster index. The sample cross-correlation is given by:

$\begin{matrix}{{R\left( {k,j,c_{0}} \right)} = {\frac{1}{L}{\sum\limits_{\underset{{c{({j,l})}} = c_{0}}{l = 0}}^{L - 1}{{A\left( {k,l} \right)} \cdot {A\left( {j,l} \right)}}}}} & (6)\end{matrix}$

Where A is the time-frequency map, k is the index of the line belongingto one of the pitches, j is any other line of A and L is the number ofcolumns of A. After computing the sample cross-correlation between eachpitch and each of the clusters in other lines, the cluster c₁ in line j₁with the highest cross-correlation is grouped with the respective pitch,and then the cluster c₂ in line j₂ with the second highestcross-correlation is grouped with the respective pitch, and so forth.This process is repeated until the sample-cross correlation goes belowsome threshold κ which can be set adaptively as, say, 0.5×(the averageenergy of the signal at a single frequency). The result of 35 is a setof groups of frequencies endowed with the respective pitch frequency.

FIG. 4 describes an example of the pitch detection over thetime-frequency map. 41 is the time axis, which is denoted by theparameter

, and 42 is the frequency axis which is described by the parameter k.Each column in this 2-dimensional array is the K-long real valued vectorextracted in 31 after averaging the absolute value of the M frequencytransformed buffers at time

. For the correlation analysis in time, the L recent vectors are savedin a 2 dimensional array of size K×L. In 43 two pitches are denoted bydiagonal lines at different directions. The pitch k=2 with its harmonicsat k=4,6,8, has reliability R=20 because of the existence of the 4^(th)harmonics, and the pitch at k=3 with its harmonics at k=6,9 hasreliability R=10. In 44 the k=3 pitch is inactive, and only k=2 isactive. However, the reliability of the k=2 pitch is decreased to R=10as the 4^(th) harmonics is not detected (below the threshold μ). In 45the pitch of k=3 is active again and the k=2 is inactive. In 46 a newpitch candidate at k=4 is emerged, but only its 2^(nd) harmonic isdetected. Therefore, it is not detected is a pitch. In 47 the k=3 pitchis inactive no pitch is detected.

FIG. 5 describes the TFC-map, whose axes are the frame index (time) 51,the frequency component 52 and the spatial cues 53, which might be, forexample, a complex value expressing the direction (azimuth andelevation) from which each frequency component arrives, and theintensity of the component. When the frames with index

are processed and transferred to the frequency domain, a vector of Mcomplex number is received for each frequency element {k}_(k=0) ^(K−1).From each vector, up to M−1 spatial cues are extracted. In the exampleof direction and intensity of each frequency component, this might bedone using any direction-finding algorithm for array processing which isknown in the art such as MUSIC or ESPRIT. The result of this algorithmis a set of up to M−1 directions in 3-dimensional space, each isexpressed by two angles and the estimated intensity of the arrivingsignal

${{p_{p}\left( {,k} \right)}\overset{\Delta}{=}\left( {{a\left( {,k} \right)},{\theta \left( {,k} \right)},{\varphi \left( {,k} \right)}} \right)_{p}},$

p=1, . . . , P≤M−1. The cues are arranged in the TFC-map such that p_(p)₀ (

₀,k₀) at the cell indexed by

₀,k₀,p₀.

APPENDIX

The performance of the speech enhancement modules depends upon theability to filter out all the interference signals leaving only thedesired speech signals. Interference signals might be, for example,other speakers, noise from air conditions, music, motor noise (e.g. in acar or airplane) and large crowd noise also known as ‘cocktail partynoise’. The performance of speech enhancement modules is normallymeasured by their ability to improve the speech-to-noise-ratio (SNR) orthe speech-to-interference-ratio (SIR), which reflects the ratio (oftenin dB scale) of the power of the desired speech signal to the totalpower of the noise and of other interfering signals respectively.

When the acquisition module contains a single microphone, the methodsare termed single-microphone speech enhancement and are often based onthe statistical features of the signal itself in the time-frequencydomain such as single channel spectral subtraction, spectral estimationusing minimum variance distortionless response (MVDR) andecho-cancelation. When more than a single microphone is used, theacquisition module is often termed microphone array, and themethods—multi-microphone speech enhancement. Many of these methodsexploit the differences between the signals captured simultaneously bythe microphones. A well-established method is the beamforming whichsums-up the signals from the microphones after multiplying each signalby a weighting factor. The objective of the weighting factors is toaverage out the interference signals so as to condition the signal ofinterest.

Beamforming, in other words, is a way of creating a spatial filter whichalgorithmically increases the power of a signal emitted from a givenlocation in space (the desired signal from the desired speaker), anddecreases the power of signals emitted from other locations in space(interfering signals from other sources), thereby increasing the SIR atthe beamformer output.

Delay-and-sum beamformer (DSB) involve using weighting factors of a DSBare composed of the counter delays implied by the different ways alongwhich the desired signal travels from its source to each of themicrophones in the array. DSB is limited to signals which come from asingle direction each, such as in free-field environments. Consequently,in reverberant environments, in which signals from the same sourcestravel along different ways to the microphones and arrive at themicrophone from a plurality of directions, DSB performance is typicallyinsufficient.

To mitigate the drawbacks of DSB in reverberant environments,beamformers may use more complicated acoustic transfer function (ATF),which represents the direction (azimuth and elevation) from which eachfrequency component arrives at a specific microphone from a givensource. A single direction of arrival (DOA), which is assumed by DSB andother DOA based methods, often doesn't hold true in reverberantenvironments, where the components of the same speech signal arrive fromdifferent directions. This is because of the different frequencyresponse of physical elements in a reverberant environment such aswalls, furniture, and peoples. The ATF in the frequency domain is avector assigning a complex number to each frequency in the Nyquistbandwidth. The absolute value represents the gain of the path related tothis frequency, and the phase indicates the phase which is added to thefrequency component along the path.

Estimating the ATF between a given point in space and a given microphonemay be done by means of using a loudspeaker positioned at the givenpoint and emitting a known signal. Taking simultaneously the signalsfrom the input of the speaker and the output of the microphone one canreadily estimate the ATF. The loudspeaker may be situated at one or morepositions where human speakers might reside during the operation of thesystem. This method creates a map of ATFs for each point in space, ormore practically, for each point on a grid. ATFs of points not includedin the grid are approximated using interpolation. Nevertheless—thismethod suffers from major drawbacks. First, the need to calibrate thesystem for each installation making this method impractical. Second, theacoustic difference between human speaker and an electronic speaker,which deviates the measured ATF from the actual one. Third, thecomplexity of measuring a huge number of ATFs, especially whenconsidering also the direction of the speaker, and forth, possibleerrors due to changes of the environment.

A more practical alternative to the ATF is the relative transferfunction (RTF) as a remedy for the disadvantages of ATF estimationmethods in practical applications. The RTF is the difference between theATFs between a given source to two of the microphones in the array,which, in the frequency domain takes the form of the ratio between thespectral representation of the two ATFs. Like the ATF, the RTF in thefrequency domain assigns a complex number to each frequency. Theabsolute value is the gain difference between the two microphones, whichis often close to unity when the microphones are close to each other,and the phase, under some conditions, reflects the incident angle of thesource.

Transfer function based linear constrained minimum variance (TF-LCMV)beamformer may reduce noise while limiting speech distortion, inmulti-microphone applications, by minimizing the output energy subjectto the constraint that the speech component in the output signal isequal to the speech component in one of the microphone signals. GivenN=N_(d)+N_(i) sources, consider the problem of extracting N_(d) desiredspeech sources, contaminated by N_(i) interfering sources, and astationary noise. Each of the involved signals propagates through theacoustic medium before being picked by an arbitrary array comprising Mmicrophones. The signal of each microphone is segmented to frames oflength T and FFT is applied to each frame. In the frequency domain, letus denote the k-th frequency component of the

-th frame of the m-th microphone and the n-th source by z_(m) (

,k)∈

, and s_(n)(

,k)∈

, respectively. Similarly, the ATF between the n-th source and the m-thmicrophone is g_(m,n)(

,k), and the noise at the m-th microphone is v_(m)(

,k). The received signal in a matrix form is given by:

z(

,k)=G(

,k)s(

,k)+v(

,k)ε

^(M)  (7)

Where z(

,k)=[z₁ (

,k), . . . z_(M) (

,k)]^(T)∈

^(M) is the sensor vector, s(

,k)=[s₁(

,k), . . . , s_(N) (

,k)]^(T)∈

^(N) is the sources vector, G(

,k)∈^(M×N) is the ATFs matrix such that [G(

,k)]_(m,n)=g_(m,n)(

,k)∈

, and v(

,k)=[v₁(

,k), . . . , v_(M)(

,k)]^(T)∈

^(M) is an additive stationary noise, uncorrelated with any of thesources. Equivalently, (7) can be formulated using the RTFs. Withoutloss of generality, the RTF of the n-th speech source h_(m,n)(

,k)∈

can be defined as the ratio between the n-th speech components at them-th microphone, and its respective component at the first microphone,i.e., h_(m,n)(

,k)=g_(m,n)(

,k)/g_(1,n)(

,k). The signal in (7) can be formulated using the RTFs matrix H(

,k)∈

^(M×N) such that[H(

,k)]_(m,n)=h_(m,n)(

,k)∈

, in a vector notation:

z(

,k)=H(

,k)x(

,k)+v(

,k)∈

^(M)  (8)

Where x_(n)(

,k)=g_(1,n)(

,k)s_(n)(

,k)∈

is the altered source signal.

There is a need to estimate the mixture of the N_(d) desired sources,given the array measurements z(

,k). The extraction of the desired signals can be accomplished byapplying a beamformer w(

,k)∈

^(M) to the microphone signals y(

,k)=w^(H)(

,k)z(

,k)∈

. Assuming M≥N, w(

,k)∈

^(M) can be chosen to satisfy the LCMV criterion:

$\begin{matrix}{{w\left( {,k} \right)} = {\underset{w}{\arg {\; \;}\min}\left\{ {{w^{H}\left( {,k} \right)}{\Phi_{vv}\left( {,k} \right)}{w\left( {,k} \right)}} \right\}}} & (9) \\{{{s.t}\mspace{14mu} {H^{H}\left( {,k} \right)}{w\left( {,k} \right)}} = {c\left( {,k} \right)}} & \;\end{matrix}$

where Φ_(vv)(

,k)∈

^(M×M) is the power spectral density (PSD) matrix of v(

,k) and c(

,k) ∈

^(N×1) is the constraint vector.

A possible solution to (9) is:

w _(LCMV)(

,k)=Φ_(vv) ⁻¹(

,k)H(

,k)(H ^(H)(

,k)Φ_(vv) ⁻¹(

,k)H(

,k))⁻¹ c(

,k)  (10)

Based on (7) an (8) and the constrains set, the components of thedesired signals at the beamformer output is given by d(

,k)=c^(H)(

,k)x(

,k)∈

, that is, the output of the beamformer is a mixture of the componentsof the desired signals as measured by the first (reference) microphone.

From the

-th set of RTFs and for each frequency component k, a set of up to M−1source, with incident angles θ_(p)(

,k), p=1, . . . , P≤M−1, and the elevation angles ϕ_(p)(

,k) can be extracted using, for example, phase-difference basedalgorithms, together with the intensity a_(p) (

,k) taken from one of the microphones which is defined as the referenceone. These 3-tuples

${p_{p}\left( {,k} \right)}\overset{\Delta}{=}{\left( {{a\left( {,k} \right)},{\theta \left( {,k} \right)},{\varphi \left( {,k} \right)}} \right)_{p} \in {\mathbb{R}}^{3}}$

are often called spatial cues.

The TF-LCMV is an applicable method for extracting M−1 speech sourceimpinging an array comprising of M sensors from different locations in areverberant environment. However, a necessary condition for the TF-LCMVto work is that the RTFs matrix H(

,k) whose columns are the RTF vectors of all the active sources in theenvironment is known and available to the TF-LCMV. This needsassociation of each frequency component to its source speaker.

Several methods may be used to assign sources to signals withoutsupplementary information. Major family of methods is termed blindsource separation (BSS) which recovers unknown signals or sources fromtheir observed mixtures. The key weakness of BSS in the frequency domainis that at each frequency, the column vectors of the mixing matrix(estimated by BSS) are permuted randomly, and without knowledge of thisrandom permutation, combining results across frequencies becomesdifficult as disclosed.

BSS may be assisted by the pitch information. However, the gender of thespeakers is required a-priory.

BSS may be used in the frequency domain, while resolving the ambiguityof the estimated mixing matrix using the maximum-magnitude method, whichassigns a specific column of the mixing matrix to the source correspondsto the maximal element in the vector. Nevertheless—this method dependsheavily on the spectral distributions of the sources as it is assumedthat the strongest component at each frequency indeed belongs to thestrongest source. However, this condition is not often met, as differentspeakers might introduce intensity peaks at different frequencies.Alternatively, source activity detection may be used, also known asvoice activity detection (VAD), such that the information on the activesource at a specific time is used to resolve the ambiguity in the mixingmatrix. The drawback of VAD is that the voice-pause cannot be robustlydetected, especially in a multi-speaker environment. Also, this methodis effective only when no more than a single speaker at a time join tothe conversation, requires a relatively long training period, and issensitive to motion during this period.

The TF-LCMV beamformer may be used as well as its extended version forbinaural speech enhancement system, together with a binaural cuesgenerator. The acoustic cues are used to segregate speech componentsfrom noise components in the input signals. The technique is based onthe auditory scene analysis theory¹, which suggest the use ofdistinctive perceptual cues to cluster signals from distinct speechsources in a “cocktail party” environment. Examples of primitivegrouping cues that may be used for speech segregation include commononsets/offsets across frequency bands, pitch (fundamental frequency),same location in space, temporal and spectral modulation, pitch andenergy continuity and smoothness. However, the underlying assumption ofthis method is that all the components of the desired speech signalshave almost the same direction. That is, almost free-field conditions,saving the effect of the head-shadow effect, which is suggested to beingcompensated for by using head related transfer functions. This isunlikely to happen in a reverberant environment.

It should be noted that even when multiple speakers are activesimultaneously, the spectral contents of the speakers do not overlap atmost of the time-frequency points. This is called W-DisjointOrthogonality, or briefly W-DO. This can be justified by the sparsenessof speech signal in time-frequency domain. According to this sparseness,the probability of the simultaneous activity of two speakers in aspecific time-frequency point is very low. In other words, in the caseof multiple simultaneous speakers, each time-frequency point most likelycorresponds to spectral content of one of speakers.

W-DO may be used to facilitate BSS by defining a specific class ofsignals which are W-DO to some extent. This may use only the first orderstatistics is needed, which is computationally economic. Furthermore, anarbitrary number of signal sources can be de-mixed using only twomicrophones, provided that the sources are W-DO and do not occupy thesame spatial positions. However, this method assumes an identicalunderlying mixing matrix across all frequencies. This assumption isessential for using histograms of the estimated mixing coefficientsacross different frequencies. However, this assumption often does nothold true in a reverberant environment, but only in free-field. Theextension of this method to the case of multipath is restricted toeither negligible energy from the multipath, or to sufficiently smoothconvolutive mixing filters so that the histogram is smeared, yetmaintaining a single peak. This assumption too does not hold true inreverberant environments in which the difference between different pathsis often too large to create a smooth histogram.

It has been found that the suggested solution performs in reverberantenvironments and does not have to rely on unnecessary assumptions andconstraints. The solution may operate even without a-priory information,even without a large training process, even without constrainingestimations of the attenuation and the delay of a given source at eachfrequency to a single point in the attenuation-delay space, even withoutconstraining estimated values of the attenuation-delay values of asingle source to create a single cluster, and even without limiting thenumber of mixed sounds to two.

In the foregoing specification, the invention has been described withreference to specific examples of embodiments of the invention. It will,however, be evident that various modifications and changes may be madetherein without departing from the broader spirit and scope of theinvention as set forth in the appended claims.

Moreover, the terms “front,” “back,” “top,” “bottom,” “over,” “under”and the like in the description and in the claims, if any, are used fordescriptive purposes and not necessarily for describing permanentrelative positions. It is understood that the terms so used areinterchangeable under appropriate circumstances such that theembodiments of the invention described herein are, for example, capableof operation in other orientations than those illustrated or otherwisedescribed herein.

Any arrangement of components to achieve the same functionality iseffectively “associated” such that the desired functionality isachieved. Hence, any two components herein combined to achieve aparticular functionality may be seen as “associated with” each othersuch that the desired functionality is achieved, irrespective ofarchitectures or intermedial components. Likewise, any two components soassociated can also be viewed as being “operably connected,” or“operably coupled,” to each other to achieve the desired functionality.

Furthermore, those skilled in the art will recognize that boundariesbetween the above described operations merely illustrative. The multipleoperations may be combined into a single operation, a single operationmay be distributed in additional operations and operations may beexecuted at least partially overlapping in time. Moreover, alternativeembodiments may include multiple instances of a particular operation,and the order of operations may be altered in various other embodiments.

However, other modifications, variations and alternatives are alsopossible. The specifications and drawings are, accordingly, to beregarded in an illustrative rather than in a restrictive sense.

The phrase “may be X” indicates that condition X may be fulfilled. Thisphrase also suggests that condition X may not be fulfilled. Forexample—any reference to a system as including a certain componentshould also cover the scenario in which the system does not include thecertain component. For example—any reference to a method as including acertain step should also cover the scenario in which the method does notinclude the certain component. Yet for another example—any reference toa system that is configured to perform a certain operation should alsocover the scenario in which the system is not configured to perform thecertain operation.

The terms “including”, “comprising”, “having”, “consisting” and“consisting essentially of” are used in an interchangeable manner. Forexample—any method may include at least the steps included in thefigures and/or in the specification, only the steps included in thefigures and/or the specification. The same applies to the system.

The system may include an array of microphones, a memory unit and one ormore hardware processors such as digital signals processors, FPGAs,ASICs, a general-purpose processor programmed to execute any of thementioned above method and the like. The system may not include thearray of microphones but may be fed from sound signals generated by thearray of microphones.

It will be appreciated that for simplicity and clarity of illustration,elements shown in the figures have not necessarily been drawn to scale.For example, the dimensions of some of the elements may be exaggeratedrelative to other elements for clarity. Further, where consideredappropriate, reference numerals may be repeated among the figures toindicate corresponding or analogous elements.

In the foregoing specification, the invention has been described withreference to specific examples of embodiments of the invention. It will,however, be evident that various modifications and changes may be madetherein without departing from the broader spirit and scope of theinvention as set forth in the appended claims.

Moreover, the terms “front,” “back,” “top,” “bottom,” “over,” “under”and the like in the description and in the claims, if any, are used fordescriptive purposes and not necessarily for describing permanentrelative positions. It is understood that the terms so used areinterchangeable under appropriate circumstances such that theembodiments of the invention described herein are, for example, capableof operation in other orientations than those illustrated or otherwisedescribed herein.

Those skilled in the art will recognize that the boundaries betweenlogic blocks are merely illustrative and that alternative embodimentsmay merge logic blocks or circuit elements or impose an alternatedecomposition of functionality upon various logic blocks or circuitelements. Thus, it is to be understood that the architectures depictedherein are merely exemplary, and that in fact many other architecturescan be implemented which achieve the same functionality.

Any arrangement of components to achieve the same functionality iseffectively “associated” such that the desired functionality isachieved. Hence, any two components herein combined to achieve aparticular functionality can be seen as “associated with” each othersuch that the desired functionality is achieved, irrespective ofarchitectures or intermedial components. Likewise, any two components soassociated can also be viewed as being “operably connected,” or“operably coupled,” to each other to achieve the desired functionality.

Furthermore, those skilled in the art will recognize that boundariesbetween the above described operations merely illustrative. The multipleoperations may be combined into a single operation, a single operationmay be distributed in additional operations and operations may beexecuted at least partially overlapping in time. Moreover, alternativeembodiments may include multiple instances of a particular operation,and the order of operations may be altered in various other embodiments.

Also for example, in one embodiment, the illustrated examples may beimplemented as circuitry located on a single integrated circuit orwithin a same device. Alternatively, the examples may be implemented asany number of separate integrated circuits or separate devicesinterconnected with each other in a suitable manner.

Also for example, the examples, or portions thereof, may implemented assoft or code representations of physical circuitry or of logicalrepresentations convertible into physical circuitry, such as in ahardware description language of any appropriate type.

Also, the invention is not limited to physical devices or unitsimplemented in non-programmable hardware but can also be applied inprogrammable devices or units able to perform the desired devicefunctions by operating in accordance with suitable program code, such asmainframes, minicomputers, servers, workstations, personal computers,notepads, personal digital assistants, electronic games, automotive andother embedded systems, cell phones and various other wireless devices,commonly denoted in this application as ‘computer systems’.

However, other modifications, variations and alternatives are alsopossible. The specifications and drawings are, accordingly, to beregarded in an illustrative rather than in a restrictive sense.

In the claims, any reference signs placed between parentheses shall notbe construed as limiting the claim. The word ‘comprising’ does notexclude the presence of other elements or steps then those listed in aclaim. Furthermore, the terms “a” or “an,” as used herein, are definedas one as or more than one. Also, the use of introductory phrases suchas “at least one” and “one or more” in the claims should not beconstrued to imply that the introduction of another claim element by theindefinite articles “a” or “an” limits any particular claim containingsuch introduced claim element to inventions containing only one suchelement, even when the same claim includes the introductory phrases “oneor more” or “at least one” and indefinite articles such as “a” or “an.”The same holds true for the use of definite articles. Unless statedotherwise, terms such as “first” and “second” are used to arbitrarilydistinguish between the elements such terms describe. Thus, these termsare not necessarily intended to indicate temporal or otherprioritization of such elements the mere fact that certain measures arerecited in mutually different claims does not indicate that acombination of these measures cannot be used to advantage.

The invention may also be implemented in a computer program for runningon a computer system, at least including code portions for performingsteps of a method according to the invention when run on a programmableapparatus, such as a computer system or enabling a programmableapparatus to perform functions of a device or system according to theinvention. The computer program may cause the storage system to allocatedisk drives to disk drive groups.

A computer program is a list of instructions such as a particularapplication program and/or an operating system. The computer program mayfor instance include one or more of: a subroutine, a function, aprocedure, an object method, an object implementation, an executableapplication, an applet, a servlet, a source code, an object code, ashared library/dynamic load library and/or other sequence ofinstructions designed for execution on a computer system.

The computer program may be stored internally on a non-transitorycomputer readable medium. All or some of the computer program may beprovided on computer readable media permanently, removably or remotelycoupled to an information processing system. The computer readable mediamay include, for example and without limitation, any number of thefollowing: magnetic storage media including disk and tape storage media;optical storage media such as compact disk media (e.g., CD-ROM, CD-R,etc.) and digital video disk storage media; nonvolatile memory storagemedia including semiconductor-based memory units such as FLASH memory,EEPROM, EPROM, ROM; ferromagnetic digital memories; MRAM; volatilestorage media including registers, buffers or caches, main memory, RAM,etc. A computer process typically includes an executing (running)program or portion of a program, current program values and stateinformation, and the resources used by the operating system to managethe execution of the process. An operating system (OS) is the softwarethat manages the sharing of the resources of a computer and providesprogrammers with an interface used to access those resources. Anoperating system processes system data and user input, and responds byallocating and managing tasks and internal system resources as a serviceto users and programs of the system. The computer system may forinstance include at least one processing unit, associated memory and anumber of input/output (I/O) devices. When executing the computerprogram, the computer system processes information according to thecomputer program and produces resultant output information via I/Odevices.

Any system referred to this patent application includes at least onehardware component.

While certain features of the invention have been illustrated anddescribed herein, many modifications, substitutions, changes, andequivalents will now occur to those of ordinary skill in the art. It is,therefore, to be understood that the appended claims are intended tocover all such modifications and changes as fall within the true spiritof the invention.

We claim:
 1. A method for speech enhancement, the method comprises:receiving or generating sound samples that represent sound signals thatwere received during a given time period by an array of microphones;frequency transforming the sound samples to providefrequency-transformed samples; clustering the frequency-transformedsamples to speakers to provide speaker related clusters, wherein theclustering is based on (i) spatial cues related to the received soundsignals and (ii) acoustic cues related to the speakers; determining arelative transfer function for each speaker of the speakers to providespeakers related relative transfer functions; applying a multiple inputmultiple output (MIMO) beamforming operation on the speakers relatedrelative transfer functions to provide beamformed signals;inverse-frequency transforming the beamformed signals to provide speechsignals.